Author
Name Claire Descombes
Affiliation Universitätsklinik für Neurochirurgie, Inselspital Bern
Degree MSc Statistics and Data Science, University of Bern
Contact claire.descombes@insel.ch

The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.


The base R plotting functions are effective and user-friendly, so we will begin by exploring their features.

Next, we will take a look at the functions offered by ggplot2, a system for declarative graphics creation. You provide the data and specify how to map variables to aesthetics and which graphical primitives to use; ggplot2 then takes care of the details. Although it is slightly more technical than base R, it produces very aesthetically pleasing results.

TTo explore R’s plotting capabilities, we’ll continue working with the NHANES datasets. Specifically, we’ll use the merged data frame that combines the demo, bpx, bmx and smq data sets. If you still have it loaded from Chapter 2, you can use it directly. Otherwise, you can download it from the data_sets folder and import it into R.

# Load the merged_nhanes CSV file
merged_nhanes <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/merged_nhanes.csv")

1 Including plots using base R

1.1 Scatter plots

plot(x = merged_nhanes$RIDAGEYR, y = merged_nhanes$BPXSY1,
     xlab = "Age of participant [years]",
     ylab = "Systolic blood pressure (1st reading) [mmHg]",
     main = "Scatterplot of age vs. systolic BP",
     col = c("steelblue"))

💡 The arguments xlab and ylab allow fyou to define a label for the x and y axes respectively. The argument main defines the main title of the plot.

1.2 Box plots

boxplot(merged_nhanes$BPXSY1 ~ merged_nhanes$RIAGENDR,
        names = c("Male", "Female"),
        xlab = "Gender",
        ylab = "Systolic blood pressure (1st reading) [mmHg]",
        main = "Boxplot of BP by gender",
        col = c("salmon", "skyblue"))

1.3 Bar plots

# We define the age groups in which to calculate the mean systolic blood pressure
age_groups <- cut(x = merged_nhanes$RIDAGEYR, breaks = seq(10, 80, 10))
head(age_groups)
## [1] (20,30] (10,20] (40,50] (10,20] (20,30] (10,20]
## Levels: (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
# We calculate the mean systolic blood pressure by age group
(avg_bp <- tapply(X = merged_nhanes$BPXSY1, INDEX = age_groups, FUN = mean, na.rm = TRUE))
##  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70]  (70,80] 
## 110.0211 113.8345 116.6714 120.4581 127.7841 132.3541 138.4203
barplot(avg_bp,
        xlab = "Age group [years]",
        ylab = "Average systolic BP [mmHg]",
        main = "Bar plot of average systolic BP by age group",
        col = "steelblue")

💡 cut() divides the range of the imputed vector into intervals and codes its values according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

💡 tapply() applies some function to an object, grouped by a list of factors. It uses the following basic syntax: tapply(X, INDEX, FUN, ..), where: X = object to apply a function to, INDEX = vector of factors to group by, FUN = function to apply. It returns a multi-way array containing the values.

1.4 Histograms

hist(merged_nhanes$BMXBMI,
     breaks = 30,
     freq = TRUE,
     col = "steelblue",
     xlab = "Body Mass Index (BMI)",
     main = "Histogram of BMI")

💡 The argument breaks is a vector giving the breakpoints between histogram cells or a function to compute the vector of breakpoints. 💡 The argument freq is a logical: if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted.

1.5 Line plots

# We define the age groups in which to calculate the mean systolic blood pressure
age_groups <- cut(x = merged_nhanes$RIDAGEYR, breaks = seq(10, 80, 10))
head(age_groups)
## [1] (20,30] (10,20] (40,50] (10,20] (20,30] (10,20]
## Levels: (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
# We calculate the mean systolic blood pressure by age group
(avg_bp <- tapply(X = merged_nhanes$BPXSY1, INDEX = age_groups, FUN = mean, na.rm = TRUE))
##  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70]  (70,80] 
## 110.0211 113.8345 116.6714 120.4581 127.7841 132.3541 138.4203
# We define the midpoints of the intervals
midpoints <- seq(15, 75, 10)

plot(midpoints, avg_bp, type = "b",
     xlab = "Age (midpoint of group) [years]",
     ylab = "Mean systolic BP [mmHg]",
     main = "Mean systolic BP by age group",
     col = "steelblue")

💡 The argument type is 1-character string giving the type of plot desired. The following values are possible:

  • “p” for points
  • “l” for lines
  • “b” for both points and lines
  • “c” for empty points joined by lines
  • “o” for overplotted points and lines
  • “s” and “S” for stair steps and “h” for histogram-like vertical lines

1.6 Pie charts

(gender_table <- table(merged_nhanes$RIAGENDR))
## 
## Female   Male 
##   3298   3251
pie(x = gender_table,
    labels = c("Female", "Male"),
    col = c("skyblue", "salmon"),
    main = "Gender Distribution")

💡 table() uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

Personally, I would recommend avoiding pie charts, as the human eye is not very good at recognising angle ratios. I favour bar plots, for example.

1.7 Mosaic plots

(smoking_gender_table <- table(merged_nhanes$RIAGENDR, merged_nhanes$SMQ020))
##         
##          Don't know   No Refused  Yes
##   Female          2 1817       0  879
##   Male            1 1220       1 1398
mosaicplot(smoking_gender_table[,c("Yes","No")], # we select only the 'yes/no' answers
           main = "Mosaic plot of smoking status by gender",
           xlab = "Gender",
           ylab = "Smoked at least 100 cigarettes?",
           col = c("steelblue"))

1.8 Forest plots

Base R does not include a built-in function for forest plots, but you can use a package like forestplot for that, or ggplot2 (see below).

library(forestplot)

# Calculate summary statistics by gender
genders <- unique(merged_nhanes$RIAGENDR)
means <- numeric(length(genders))
sds <- numeric(length(genders))
ns <- numeric(length(genders))
lowers <- numeric(length(genders))
uppers <- numeric(length(genders))
gender_labels <- character(length(genders))

# Loop over gender groups to compute stats
for (i in seq_along(genders)) {
  group_data <- merged_nhanes$BPXSY1[merged_nhanes$RIAGENDR == genders[i]]
  group_data <- group_data[!is.na(group_data)]
  means[i] <- mean(group_data)
  sds[i] <- sd(group_data)
  ns[i] <- length(group_data)
  lowers[i] <- means[i] - 1.96 * sds[i] / sqrt(ns[i])
  uppers[i] <- means[i] + 1.96 * sds[i] / sqrt(ns[i])
  gender_labels[i] <- genders[i]
}

# Create the forest plot
forestplot(labeltext = c("Male","Female"),
           mean = means,
           lower = lowers,
           upper = uppers,
           zero = 120,
           grid = TRUE,
           ci.vertices = TRUE,
           xlab = "Systolic BP (95% CI) [mmHg]",
           col = fpColors(box = "steelblue", line = "steelblue"),
           title = "Forest plot of systolic BP by gender")

1.9 Exercises

✏️ Exercise on plotting with base R n°1: Plot the first, second and third systolic blood pressure readings by gender using base R. Display all three plots (1st, 2nd and 3rd readings) in the same output panel.

💡 Hint: Try using the par() function (e.g. par(mfrow=c(2,3)); do not forget to reset it to par(mfrow=c(1,1)) afterwards!)

✏️ Exercise on plotting with base R n°2: Create a histogram showing the systolic blood pressure (1st reading) for each smoking status category (excluding those who refused to provide their smoking status). Set the number of breaks to 20. You should get three different plots. Display them all in the same output panel.

2 Including plots using ggplot2

# Start by installing ggplot2
install.packages("ggplot2")
library(ggplot2) # after installation

# Note that the ggplot2 package is included in some R package collections, e.g. the tidyverse collection, which is specifically designed for data science and also contains other useful packages like dplyr.
install.packages("tidyverse")
library(tidyverse) # after installation

ggplot2 is based on the grammar of graphics — the idea that every graph can be built from the same basic components:

  • a data set,
  • a coordinate system, and
  • geoms: visual marks (like points, lines, bars) that represent data.

Structure of a ggplot2 graph:

How to build a plot:

  1. Start with ggplot(), where you provide the dataset (and optionally the aesthetic mapping).
  2. Add layers, such as:
  • geom_point() (scatter plot)
  • geom_histogram() (histogram)
  • geom_line() (line plot)
  • scales (e.g., scale_y_log10())
  • faceting (e.g., facet_wrap())
  • coordinate systems (e.g., coord_flip())
  1. Define aesthetics using aes(): these describe how variables in the data are mapped to visual properties like: x, y, color, fill, linetype, size, etc.

💡 You can place aes() either in ggplot() (to apply it globally) or inside the individual geom_*() functions (to apply it locally).

The geom function must be chosen according to the type of data examined (two continuous variables, one categorical and one continuous variable, etc.).

2.1 Scatter plots

ggplot(merged_nhanes, aes(x = RIDAGEYR, y = BPXSY1)) +
  geom_point(alpha = 0.5, col = "steelblue") +
  labs(title = "Scatterplot of age vs. systolic BP",
       x = "Age of participant [years]",
       y = "Systolic blood pressure (1st reading) [mmHg]") +
  theme_minimal()

Or equivalently:

ggplot(merged_nhanes) +
  geom_point(aes(x = RIDAGEYR, y = BPXSY1), alpha = 0.5, col = "steelblue") +
  labs(title = "Scatterplot of age vs. systolic BP",
       x = "Age of participant [years]",
       y = "Systolic blood pressure (1st reading) [mmHg]") +
  theme_minimal()

2.2 Box plots

ggplot(data = merged_nhanes, aes(x = factor(RIAGENDR, labels = c("Male", "Female")), y = BPXSY1, fill = factor(RIAGENDR))) +
  geom_boxplot() +
  scale_fill_manual(values = c("salmon", "skyblue")) +
  labs(title = "Box plot of systolic BP by gender",
       x = "Gender",
       y = "Systolic BP (1st reading) [mmHg]") +
  theme_minimal() +
  theme(legend.position = "none")

2.3 Bar plots

# We define the age groups in which to calculate the mean systolic blood pressure
age_groups <- cut(x = merged_nhanes$RIDAGEYR, breaks = seq(10, 80, 10))
head(age_groups)
## [1] (20,30] (10,20] (40,50] (10,20] (20,30] (10,20]
## Levels: (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
# We calculate the mean systolic blood pressure by age group
(avg_bp <- aggregate(BPXSY1 ~ age_groups, data = merged_nhanes, FUN = function(x) mean(x, na.rm = TRUE)))
ggplot(avg_bp, aes(x = age_groups, y = BPXSY1)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Bar plot of mean systolic BP by age group",
       x = "Age group [years]",
       y = "Mean systolic BP [mmHg]") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

💡 aggregate() splits the data into subsets, computes summary statistics for each, and returns the result as a data frame.

2.4 Histograms

ggplot(merged_nhanes, aes(x = BMXBMI)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Histogram of BMI",
       x = "Body Mass Index (BMI)",
       y = "Count") +
  theme_minimal()

💡 A quick reminder that the number of bins in a histogram is extremely important and should be given careful consideration, as the appearance of the data can vary significantly depending on this choice, which could potentially be misleading.

An example with the age of the household reference person:

ggplot(merged_nhanes, aes(x = DMDHRAGE)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  labs(title = "Histogram of age of the household reference person (40 bins)",
       x = "Age [years]",
       y = "Count") +
  theme_minimal()

ggplot(merged_nhanes, aes(x = DMDHRAGE)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  labs(title = "Histogram of age of the household reference person (20 bins)",
       x = "Age [years]",
       y = "Count") +
  theme_minimal()

ggplot(merged_nhanes, aes(x = DMDHRAGE)) +
  geom_histogram(bins = 10, fill = "steelblue", color = "white") +
  labs(title = "Histogram of age of the household reference person (10 bins)",
       x = "Age [years]",
       y = "Count") +
  theme_minimal()

2.5 Line plots

avg_bp$midpoint <- seq(15, 75, 10)  # midpoint of intervals

ggplot(avg_bp, aes(x = midpoint, y = BPXSY1)) +
  geom_line(col = "steelblue") +
  geom_point(col = "steelblue") +
  labs(title = "Mean systolic BP by age group",
       x = "Age (midpoint of group) [years]",
       y = "Mean systolic BP [mmHg]") +
  theme_minimal()

2.6 Pie charts

# Count gender
gender_counts <- as.data.frame(table(merged_nhanes$RIAGENDR))
colnames(gender_counts) <- c("Gender", "n")

# Compute proportions
gender_counts$prop <- gender_counts$n / sum(gender_counts$n)

# Compute y positions for labels (center of each slice)
gender_counts$ypos <- cumsum(gender_counts$prop) - 0.5 * gender_counts$prop

# Resulting data frame
gender_counts
ggplot(gender_counts, aes(x = "", y = prop, fill = Gender)) +
  geom_col(width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(y = ypos, label = scales::percent(prop)), color = "white") +
  scale_fill_manual(values = c("salmon", "skyblue")) +
  labs(title = "Pie chart of gender distribution") +
  theme_void()

2.7 Mosaic plots

ggplot2 does not include a built-in function for mosaic plots, but can be extended with the ggmoisaic package for that.

library(ggmosaic)

ggplot(data = merged_nhanes[merged_nhanes$SMQ020 %in% c("Yes","No"),]) +
  geom_mosaic(aes(x = product(RIAGENDR), fill = SMQ020)) +
  scale_fill_manual(values = c("salmon", "skyblue")) +
  labs(x = "Gender", y = "Smoked at least 100 cigarettes?", title = "Mosaic plot of smoking status by gender") +
  theme_minimal() +
  theme(legend.position = "none")

2.8 Forest plots

# Calculate summary statistics by gender
genders <- unique(merged_nhanes$RIAGENDR)
means <- numeric(length(genders))
sds <- numeric(length(genders))
ns <- numeric(length(genders))
lowers <- numeric(length(genders))
uppers <- numeric(length(genders))

# Loop over gender groups to compute stats
for (i in seq_along(genders)) {
  group_data <- merged_nhanes$BPXSY1[merged_nhanes$RIAGENDR == genders[i]]
  group_data <- group_data[!is.na(group_data)]
  means[i] <- mean(group_data)
  sds[i] <- sd(group_data)
  ns[i] <- length(group_data)
  lowers[i] <- means[i] - 1.96 * sds[i] / sqrt(ns[i])
  uppers[i] <- means[i] + 1.96 * sds[i] / sqrt(ns[i])
}

# Create result data frame
(forest_df <- data.frame(
  Gender = genders,
  mean = means,
  sd = sds,
  n = ns,
  lower = lowers,
  upper = uppers
))
ggplot(forest_df, aes(x = mean, y = Gender)) +
  geom_point(size = 3, color = "steelblue") +
  geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.2, color = "steelblue") +
  geom_vline(xintercept = 120, linetype = "dashed") +
  labs(title = "Forest plot of systolic BP by gender",
       x = "Systolic BP (95% CI) [mmHg]",
       y = "") +
  theme_minimal()

2.9 Exercises

✏️ Exercise on plotting with ggplot2: Using the same age groups as in the examples, create a bar plot showing the median BMI by age group. Then produce two further bar plots showing the same data for each gender, and display them in the same output panel, using the same y-axis to allow for better comparison.

💡 Hints: Try using the grid.arrange() function from the gridExtra package (e.g. grid.arrange(plot1, plot2, ncol=2)), and the argument ylim in the ggplot() function.

3 Saving plots

3.1 Base R

We would use the jpeg() function to save a plot as a JPEG image and the png() function to save it as a PNG image. Please note that after the plotting and saving, we need to call the function dev.off() to save the file.

We can specify the resolution we want with arguments width and height. We can also specify the full path of the file we want to save if we don’t want to save it in the current directory.

plotting_directory <- "/home/claire/Documents/GitHub/rforphysicians/docs/plots/"

jpeg(file = paste0(plotting_directory,"hist_BMI.jpeg"), 
     width = 10, 
     height = 6, 
     units = "in",
     res=100)
hist(merged_nhanes$BMXBMI,
     breaks = 30,
     freq = TRUE,
     col = "steelblue",
     xlab = "Body Mass Index (BMI)",
     main = "Histogram of BMI")
dev.off()

png(file= paste0(plotting_directory,"pie_chart_gender.png"), 
    width = 10, 
    height = 6, 
    units="in",
    res=100)
pie(x = gender_table,
    labels = c("Female", "Male"),
    col = c("skyblue", "salmon"),
    main = "Gender Distribution")
dev.off()

3.2 ggplot2

Saving plots created using the ggplot2 package is very easy with the ggsave() function. However, the plot must first be stored in a variable; otherwise, the function will simply save the last plot displayed. The function guesses the type of graphics from the file extension.

We can specify the resolution we want with arguments width and height. We can also specify the full path of the file we want to save if we don’t want to save it in the current directory.

boxplot_syst_bp_by_gender <- 
  ggplot(data = merged_nhanes, aes(x = factor(RIAGENDR, labels = c("Male", "Female")), y = BPXSY1, fill = factor(RIAGENDR))) +
  geom_boxplot() +
  scale_fill_manual(values = c("salmon", "skyblue")) +
  labs(title = "Box plot of systolic BP by gender",
       x = "Gender",
       y = "Systolic BP (1st reading) [mmHg]") +
  theme_minimal() +
  theme(legend.position = "none")

ggsave(filename = "boxplot_syst_bp_by_gender.png",
  plot = boxplot_syst_bp_by_gender,
  bg = 'white',
  width = 10,
  height = 6,
  units = c("in"),
  path = plotting_directory)

3.3 Exercises

✏️ Exercise on saving plots n°1: Save the three box plots from the plotting exercise using base R n°1 as one JPEG image.

✏️ Exercise on saving plots n°2: Save the two bar plots from the plotting exercise using ggplot2 as one PNG image.

4 Solutions to the exercises

☑️ Exercise on plotting with base R n°1

par(mfrow=c(1,3))
boxplot(merged_nhanes$BPXSY1 ~ merged_nhanes$RIAGENDR,
        names = c("Male", "Female"),
        xlab = "Gender",
        ylab = "Systolic blood pressure (1st reading) [mmHg]",
        main = "Boxplot of BP by gender",
        col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY2 ~ merged_nhanes$RIAGENDR,
        names = c("Male", "Female"),
        xlab = "Gender",
        ylab = "Systolic blood pressure (2nd reading) [mmHg]",
        main = "Boxplot of BP by gender",
        col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY3 ~ merged_nhanes$RIAGENDR,
        names = c("Male", "Female"),
        xlab = "Gender",
        ylab = "Systolic blood pressure (3rd reading) [mmHg]",
        main = "Boxplot of BP by gender",
        col = c("salmon", "skyblue"))

par(mfrow=c(1,1))

☑️ Exercise on plotting with base R n°2

# We identify the different smoking status
table(merged_nhanes$SMQ040)
## 
##  Every day Not at all    Refused  Some days 
##        878       1210          2        187
par(mfrow=c(3,1))
hist(merged_nhanes[merged_nhanes$SMQ040=="Not at all",]$BPXSY1,
     breaks = 20,
     freq = TRUE,
     col = "darkgreen",
     xlab = "Systolic blood pressure (1st reading) [mmHg]",
     main = "Histogram of systolic BP: not smoking")
hist(merged_nhanes[merged_nhanes$SMQ040=="Some days",]$BPXSY1,
     breaks = 20,
     freq = TRUE,
     col = "orange2",
     xlab = "Systolic blood pressure (1st reading) [mmHg]",
     main = "Histogram of systolic BP: smoking every other day")
hist(merged_nhanes[merged_nhanes$SMQ040=="Every day",]$BPXSY1,
     breaks = 20,
     freq = TRUE,
     col = "coral3",
     xlab = "Systolic blood pressure (1st reading) [mmHg]",
     main = "Histogram of systolic BP: smoking every day")

par(mfrow=c(1,1))

# In case you were wondering why the BP is higher in the group of non-smokers
merged_nhanes$age_group <- cut(merged_nhanes$RIDAGEYR, breaks = seq(10, 80, 10))
table(merged_nhanes[merged_nhanes$SMQ040!="Refused",]$SMQ040, merged_nhanes[merged_nhanes$SMQ040!="Refused",]$age_group)
##             
##              (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
##   Every day       16     165     187     179     172     111      48
##   Not at all       3      84     152     148     228     283     312
##   Some days        5      60      35      30      26      24       7

☑️ Exercise on plotting with ggplot2

# We define the age groups
age_groups <- cut(x = merged_nhanes$RIDAGEYR, breaks = seq(10, 80, 10))
head(age_groups)
## [1] (20,30] (10,20] (40,50] (10,20] (20,30] (10,20]
## Levels: (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
# We calculate the median BMI by age group
(med_bmi <- aggregate(BMXBMI ~ age_groups, data = merged_nhanes, FUN = function(x) median(x, na.rm = TRUE)))
ggplot(med_bmi, aes(x = age_groups, y = BMXBMI)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Bar plot of median BMI by age group",
       x = "Age group [years]",
       y = "Median BMI") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# We calculate the median BMI by age group and by gender
(med_bmi <- aggregate(BMXBMI ~ age_groups + RIAGENDR, data = merged_nhanes, FUN = function(x) median(x, na.rm = TRUE)))
med_bmi_fem <- med_bmi[med_bmi$RIAGENDR == "Female",]
med_bmi_male <- med_bmi[med_bmi$RIAGENDR == "Male",]

plot1 <- ggplot(med_bmi_fem, aes(x = age_groups, y = BMXBMI)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Median BMI, gender = female",
       x = "Age group [years]",
       y = "Median BMI") +
  theme_minimal() +
  ylim(c(0,30)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot2 <- ggplot(med_bmi_male, aes(x = age_groups, y = BMXBMI)) +
  geom_bar(stat = "identity", fill = "salmon") +
  labs(title = "Median BMI, gender = male",
       x = "Age group [years]",
       y = "Median BMI") +
  theme_minimal() +
  ylim(c(0,30)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

library(gridExtra)
grid.arrange(plot1, plot2, ncol=2)

☑️ Exercise on saving plots n°1

plotting_directory <- "/home/claire/Documents/GitHub/rforphysicians/docs/plots/" # your plotting directory here

jpeg(file = paste0(plotting_directory,"boxplots_bp_by_gender.jpeg"), 
     width = 10, 
     height = 6, 
     units = "in",
     res=100)
par(mfrow=c(1,3))
boxplot(merged_nhanes$BPXSY1 ~ merged_nhanes$RIAGENDR,
        names = c("Male", "Female"),
        xlab = "Gender",
        ylab = "Systolic blood pressure (1st reading) [mmHg]",
        main = "Boxplot of BP by gender",
        col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY2 ~ merged_nhanes$RIAGENDR,
        names = c("Male", "Female"),
        xlab = "Gender",
        ylab = "Systolic blood pressure (2nd reading) [mmHg]",
        main = "Boxplot of BP by gender",
        col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY3 ~ merged_nhanes$RIAGENDR,
        names = c("Male", "Female"),
        xlab = "Gender",
        ylab = "Systolic blood pressure (3rd reading) [mmHg]",
        main = "Boxplot of BP by gender",
        col = c("salmon", "skyblue"))
dev.off()
par(mfrow=c(1,1))

☑️ Exercise on saving plots n°2

plotting_directory <- "/home/claire/Documents/GitHub/rforphysicians/docs/plots/" # your plotting directory here

bar_plot_med_bmi_by_gender <- grid.arrange(plot1, plot2, ncol=2)
ggsave(filename = "bar_plot_med_bmi_by_gender.png",
  plot = bar_plot_med_bmi_by_gender,
  bg = 'white',
  width = 10,
  height = 6,
  units = c("in"),
  path = plotting_directory)

5 NHANES data sets

Here are listed some of the variables from the NHANES data sets.

5.1 Demographics (DEMO_G)

Variable Name Description
RIDAGEYR Participant’s age in years
RIAGENDR Participant’s gender
DMDHHSIZ Total number of people in the household
DMDHHSZA Number of children aged 5 or younger in the household
DMDHRAGE Age of the household reference person
DMDHRMAR Marital status of the household reference person
DMDHRGND Gender of the household reference person
AIALANGA Language of the interview

5.2 Blood Pressure (BPX_G)

Variable Name Description
BPXSY1 Systolic blood pressure (first reading) in mm Hg
BPXSY2 Systolic blood pressure (second reading) in mm Hg
BPXSY3 Systolic blood pressure (third reading) in mm Hg
BPXDI1 Diastolic blood pressure (first reading) in mm Hg
BPXDI2 Diastolic blood pressure (second reading) in mm Hg
BPXDI3 Diastolic blood pressure (third reading) in mm Hg
BPXPULS Pulse rate (beats per minute)
BPXPLS 60-second pulse (30-second pulse multiplied by 2)
BPXPTY Pulse type (e.g., regular or irregular)
BPXML1 Maximum inflation level (mm Hg)

5.3 Body Measures (BMX_G)

Variable Name Description
BMXWT Weight (kg)
BMXHT Standing height (cm)
BMXBMI Body Mass Index (kg/m²)
BMXWAIST Waist circumference (cm)
BMXHIP Hip circumference (cm)
BMXARML Upper arm length (cm)
BMXARMC Upper arm circumference (cm)
BMXLEG Upper leg length (cm)

5.4 Smoking Questionnaire (SMQ_G)

Variable Name Description
SMQ020 Smoked at least 100 cigarettes in life
SMQ040 Do you now smoke cigarettes?
SMQ050Q Average number of cigarettes smoked per day
SMD030 Age when first smoked cigarettes regularly
SMD070 Age when last smoked cigarettes regularly
SMQ680 Smoked cigars in the past 5 days
SMQ690 Smoked pipes in the past 5 days
SMQ700 Smoked chewing tobacco in the past 5 days
SMQ710 Smoked snuff in the past 5 days

References

Alexander Henzi. 2021. “Programming and Data Analysis with R.” Lecture notes.
Burns, Patrick. n.d. The R Inferno. Accessed May 8, 2025. https://www.burns-stat.com/documents/books/the-r-inferno/.
ChatGPT.” n.d. Accessed January 26, 2025. https://chatgpt.com.
Christopher J. Endres. 2025. “Introducing nhanesA.” https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html.
“Create Elegant Data Visualisations Using the Grammar of Graphics.” n.d. Accessed January 26, 2025. https://ggplot2.tidyverse.org/.
David, Author. 2016. BIRT Joins.” MBSE Chaos. https://mbsechaos.wordpress.com/2016/05/24/birt-joins/.
Elena Kosourova. n.d. RStudio Tutorial for Beginners: A Complete Guide.” Accessed January 26, 2025. https://www.datacamp.com/tutorial/r-studio-tutorial.
Grolemund, Hadley Wickham and Garrett. n.d. R for Data Science. Accessed May 8, 2025. https://r4ds.had.co.nz/introduction.html.
Mayer, Michael. 2025. “Mayer79/Statistical_computing_material.” https://github.com/mayer79/statistical_computing_material.
Patrick Burns. n.d. Impatient R. Accessed May 8, 2025. https://www.burns-stat.com/documents/tutorials/impatient-r/.
“Synthetic Dataset for AI in Healthcare.” n.d. Accessed May 9, 2025. https://www.kaggle.com/datasets/smmmmmmmmmmmm/synthetic-dataset-for-ai-in-healthcare.
“The Comprehensive R Archive Network.” n.d. Accessed January 26, 2025. https://stat.ethz.ch/CRAN/.
W. N. Venables, D. M. Smith and the R Core Team. n.d. “An Introduction to R.” Accessed May 8, 2025. https://cran.r-project.org/doc/manuals/r-release/R-intro.html.
Wickham, Hadley. n.d. Advanced R. Accessed May 8, 2025. https://adv-r.hadley.nz/introduction.html.